GWAS: Techniques and Analysis
Overview of Genomic Techniques: Genome Wide Association Studies (GWAS) is a very well-established method among several assays in the scientist's toolkit to determine information contained within large amounts of DNA. GWAS occupies a special position amongst the genomic techniques because it does not sequence every single nucleotide, but simultaneously probes for differences at many specific genetic sequences, much like In-Situ HybridizationWikipedia: "In-Situ Hybridization" http://en.wikipedia.org/wiki/In_situ_hybridization. GWAS is a cost effective technique used to locate and describe important Single Nucleotide Polymorphisms in populations. Complete Sequencing Vs. Genotyping The primary objective of most GWAS is to associate genotype with phenotype, and to define the effect of certain alleles on that phenotype. Phenotypes can include, but are not limited to, physical traits such as hair and eye color. Other phenotype associations that can be investigated are various aspects of diseases including mortality, personality traits, psychological attributes and others. The first GWA Study was a case-control study that associations between 2 SNPs and age-related macular degenerationKlein RJ et al., 2005 , Complement factor H polymorphism in age-related macular degeneration. Science. http://www.ncbi.nlm.nih.gov/pubmed/15761122 . Procedure and Technology of GWAS There are 2 major competing technologies marketed by 2 companies (Affymetrix and Illumina) used for SNP genotyping, and both companies use immobilized oligonucleotides to capture PCR-amplified DNA fragments. Affymetrix uses advanced printing technology to print synthetic DNA sequences on the chip that are complementary to specific genes or DNA sequences containing SNP variants of sequences of interest. The precise location of each DNA print is known, and hybridized sequences are detected with a computer-directed laser-scanner. There are multiple copies of each printed sequence. Illumina uses a different approach, using approximately 20bp-long oligonucleotides attached to beads by their 3 prime end. Like the Affymetrix technology there is also redundancy of oligonucleotides: Each bead as many copies of the same oligonucleotide and there are many copies of each bead. The nucleotide on the free 5 prime end of the oligonucleotide probe is the nucleotide directly upstream to the location of the SNP nucleotide pair. After a DNA fragment hybridizes to the probe, di-deoxy labeled nucleotides are washed over the micro-array and DNA polymerases enable di-deoxy nucleotides that are complementary to the exposed DNA sequence to be added. The di-doexy nucleotides are then detected using computer-guided, laser scanning that detects emission spectra specific to each di-deoxy nucleotide. The Illumina method is considered more specific than Affymetrix because it is more likely for non-complementary DNA fragments to incorrectly hybridize to the printed oligonucleotides (using Affymetrix technology) than for a non-complementary nucleotide to be added (using Illumina technology). Both methods have many copies of the same capture probe and measure the proportion of signals of each SNP, and use this proportion to determine if an individual is homozygous or heterozygous for a particular SNPBush WS, Moore JH, 2012, Chapter 11: Genome-Wide Association Studies. PLoS Comput Biol, http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002822. Illumina Genotyping Video Data Analysis: Manhattan Plots The basic structure of a Manhattan Plot is shown below. The name of this kind of graph is due to it's strikingly similarity to the skyline of Manhattan. On the X axis are data from every sample with every single nucleotide polymorphism (SNP) that was originally tested for by the Gene-Chip. The chromosomal location of each SNP is shown. On the y axis is a negative logarithm of the P value of the association of that SNP. For instance, without delving into too much stats, let us imagine that at one genomic location, out of 20 people with a disease, 19 have a Adenine and 1 person has a Guanosine, while in healthy people, 19 people have Guanosine and 1 person has a Cytosine. It seems like having Adenine at this location is associated with having the disease, and in this hypothetical example lets say that the P value of this result is .05. This means that there is a 5% chance of getting this result, if there is really no connection between having an Adenine and having the disease. As one can see, the strength of the association between two observations will increase with a decreasing P-Value: If P=0.03 there is a 3% of getting the result, if P=0.01 there 1% of getting the result, and etc.). Often in biology, a P value of less than or equal to 0.05 has been arbitrarily considered significant, however in valid GWAS studies, a much lower threshold for P values of less than or equal to 0.00000005 has advocated for by some, in order to account for anomalies that can crop up in such large and complex data-sets. The greater the sample size, the more strong associations can be made. For reference, the y axis of a Manhattan plot is equal to the -log(P). For this to equal 5, the P value is 0.00001. If the Y Axis is 10, the P value for that association is .0000000001. Keep these in mind when reading the following Data. '' Example of A Manhattan Plot'' Demonstration of Several GWAS Findings ]] Specific GWAS Example'''Reiner, P, et al., 2013, Soluble CD14: genome-wide association analysis and relationship to cardiovascular risk and mortality in the older adults, Arterioscler Thromb Vasc Biol. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3826541/:' Researchers at the University of Vermont published an analysis of Genome-Wide Associations between SNPs, a soluble form of the inflammatory protein CD14 and mortality from participants in the Cardiovascular Health Study. The participants in the Cardiovascular Health Study had blood samples taken and have been followed ever since. A custom Illumina gene-chip that had a high-density of SNPS in genes associated with cardiovascular disease, including some genes associated with CD14, was used to analyze DNA from each participant. Some key data from the study is summarized below. ' Concentration of sCD14 and Manhattan Plots For European Americans (N=2,952) and African Americans (N=528)' '''Top Genome-Wide Findings From Both Manhattan Plots' In European Americans and African Americans, the SNP located at rs5744441 was associated with a 101 ± 10 ng/mL lower concentration of sCD14 (P= 2.98 × 10−23). Another association found was in a gene called PIGC which codes for a protein that is very important in lipid anchor biosynthesis. CD14 is principally a GPI-anchored membrane protein, and is hypothesized to become soluble for various reasons, one being that it does not receive its post-translational attachment of the lipid-anchor moiety and is directly excreted from the cell. A dysfunctional PIGC protein would theoretically result in increased soluble forms of proteins that are generally attached to the membrane. In the PIGC gene, a C or T polymorphism was identified in both African Americans and European Americans. In European populations the T allele is the major allele, and in African Americans the C allele was identified as the major allele. The T allele is a missense mutation in the PIGC gene. There was a significant association between the T allele and increased sCD14 in European Americans, however no association was made between sCD14 and the PIGC gene in African AmericansReiner, P, et al., 2013, Soluble CD14: genome-wide association analysis and relationship to cardiovascular risk and mortality in the older adults, Arterioscler Thromb Vasc Biol. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3826541/. References: